Search CORE

42 research outputs found

Online estimation of the geometric median in Hilbert spaces : non asymptotic confidence balls

Author: Cardot Hervé
Cénac Peggy
Godichon Antoine
Publication venue
Publication date: 27/01/2015
Field of study

Estimation procedures based on recursive algorithms are interesting and powerful techniques that are able to deal rapidly with (very) large samples of high dimensional data. The collected data may be contaminated by noise so that robust location indicators, such as the geometric median, may be preferred to the mean. In this context, an estimator of the geometric median based on a fast and efficient averaged non linear stochastic gradient algorithm has been developed by Cardot, C\'enac and Zitt (2013). This work aims at studying more precisely the non asymptotic behavior of this algorithm by giving non asymptotic confidence balls. This new result is based on the derivation of improved

L^2

rates of convergence as well as an exponential inequality for the martingale terms of the recursive non linear Robbins-Monro algorithm

arXiv.org e-Print Archive

HAL-uB

HAL - Université de Franche-Comté

A fast and recursive algorithm for clustering large datasets with $k$ -medians

Author: Cardot Hervé
Cénac Peggy
Monnez Jean-Marie
Publication venue
Publication date: 18/10/2011
Field of study

Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the

k

-means algorithm, a new class of recursive stochastic gradient algorithms designed for the

k

-medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions, which are known to have better performances, and a data-driven procedure that allows automatic selection of the value of the descent step is proposed. The performance of the averaged sequential estimator is compared on a simulation study, both in terms of computation speed and accuracy of the estimations, with more classical partitioning techniques such as

k

-means, trimmed

k

-means and PAM (partitioning around medoids). Finally, this new online clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 individual television audiences measured every minute over a period of 24 hours.Comment: Under revision for Computational Statistics and Data Analysi

arXiv.org e-Print Archive

HAL-uB

HAL - Université de Franche-Comté

INRIA a CCSD electronic archive server

Digital search trees and chaos game representation

Author: Chauvin Brigitte
Cénac Peggy
Ginouillac Stéphane
Pouyanne Nicolas
Publication venue
Publication date: 01/01/2006
Field of study

In this paper, we consider a possible representation of a DNA sequence in a quaternary tree, in which on can visualize repetitions of subwords. The CGR-tree turns a sequence of letters into a digital search tree (DST), obtained from the suffixes of the reversed sequence. Several results are known concerning the height and the insertion depth for DST built from i.i.d. successive sequences. Here, the successive inserted wors are strongly dependent. We give the asymptotic behaviour of the insertion depth and of the length of branches for the CGR-tree obtained from the suffixes of reversed i.i.d. or Markovian sequence. This behaviour turns out to be at first order the same one as in the case of independent words. As a by-product, asymptotic results on the length of longest runs in a Markovian sequence are obtained

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Variable length Markov chains and dynamical sources

Author: Chauvin Brigitte
Cénac Peggy
Paccaut Frédéric
Pouyanne Nicolas
Publication venue
Publication date: 16/07/2010
Field of study

Infinite random sequences of letters can be viewed as stochastic chains or as strings produced by a source, in the sense of information theory. The relationship between Variable Length Markov Chains (VLMC) and probabilistic dynamical sources is studied. We establish a probabilistic frame for context trees and VLMC and we prove that any VLMC is a dynamical source for which we explicitly build the mapping. On two examples, the ``comb'' and the ``bamboo blossom'', we find a necessary and sufficient condition for the existence and the unicity of a stationary probability measure for the VLMC. These two examples are detailed in order to provide the associated Dirichlet series as well as the generating functions of word occurrences.Comment: 45 pages, 15 figure

arXiv.org e-Print Archive

HAL-uB

HAL - Université de Franche-Comté

Hal-Diderot

HAL UVSQ

Dynamical Systems in the Analysis of Biological Sequences

Author: Cénac Peggy
Fayolle Guy
Lasgouttes Jean-Marc
Publication venue: HAL CCSD
Publication date: 01/01/2004
Field of study

The Chaos Game Representation (CGR) maps a sequence of letters taken from a finite alphabet onto the unit square in

R^2

. While it is a popular tool, few mathematical results have been proved to date. In this report, we show that the CGR gives rise to a limit measure, assuming only the input sequence is stationary ergodic. Some more precise properties are given in the i.i.d. and Markov cases. A new family of statistical tests to characterize the randomness of the inputs is proposed and analyzed. Finally, some basic properties of the CGR are used to generalize the notion of genomic signatur

INRIA a CCSD electronic archive server

Persistent random walks, variable length Markov chains and piecewise deterministic Markov processes *

Author: Chauvin Brigitte
Cénac Peggy
Herrmann Samuel
Vallois Pierre
Publication venue: Polymat Publishing Company
Publication date: 01/01/2012
Field of study

International audienceA classical random walk (S t , t ∈ N) is defined by S t := t n=0 X n , where (X n) are i.i.d. When the increments (X n) n∈N are a one-order Markov chain, a short memory is introduced in the dynamics of (S t). This so-called " persistent " random walk is nolonger Markovian and, under suitable conditions, the rescaled process converges towards the integrated telegraph noise (ITN) as the timescale and space-scale parameters tend to zero (see [11, 17, 18]). The ITN process is effectively non-Markovian too. The aim is to consider persistent random walks (S t) whose increments are Markov chains with variable order which can be infinite. This variable memory is enlighted by a one-to-one correspondence between (X n) and a suitable Variable Length Markov Chain (VLMC), since for a VLMC the dependency from the past can be unbounded. The key fact is to consider the non Markovian letter process (X n) as the margin of a couple (X n , M n) n≥0 where (M n) n≥0 stands for the memory of the process (X n). We prove that, under a suitable rescaling, (S n , X n , M n) converges in distribution towards a time continuous process (S 0 (t), X(t), M (t)). The process (S 0 (t)) is a semi-Markov and Piecewise Deterministic Markov Process whose paths are piecewise linear

arXiv.org e-Print Archive

HAL-uB

CiteSeerX

HAL - Université de Franche-Comté

INRIA a CCSD electronic archive server

HAL UVSQ

Digital search trees and chaos game representation

Author: Aldous
Almeida
Blom
Bounds
Brigitte Chauvin
Cénac
Drmota
Fu
Gerber
Goldman
Gordon
Jeffrey
Nicolas Pouyanne
Peggy Cénac
Penney
Pittel
Pozdnyakov
Reinert
Robin
Roy
Régnier
Samarova
Shuo-Yen Robert Li
Stefanov
Stéphane Ginouillac
Publication venue: 'EDP Sciences'
Publication date: 01/01/2009
Field of study

Version préliminaire (2006) d'un travail publié sous forme définitive (2009).International audienceIn this paper, we consider a possible representation of a DNA sequence in a quaternary tree, in which on can visualize repetitions of subwords. The CGR-tree turns a sequence of letters into a digital search tree (DST), obtained from the suffixes of the reversed sequence. Several results are known concerning the height and the insertion depth for DST built from i.i.d. successive sequences. Here, the successive inserted wors are strongly dependent. We give the asymptotic behaviour of the insertion depth and of the length of branches for the CGR-tree obtained from the suffixes of reversed i.i.d. or Markovian sequence. This behaviour turns out to be at first order the same one as in the case of independent words. As a by-product, asymptotic results on the length of longest runs in a Markovian sequence are obtained

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

EDP Sciences OAI-PMH repository (1.2.0)

INRIA a CCSD electronic archive server

Numérisation de Documents Anciens Mathématiques

HAL-INSA Toulouse

HAL UVSQ

On the convergence of moments in the almost sure central limit theorem for stochastic approximation algorithms

Author: Peggy Cénac
Publication venue: 'EDP Sciences'
Publication date: 01/01/2013
Field of study

Crossref